Sparse Single-Index Model 



Pierre Alquiei]^ 
LPMAQ 

Universite Paris Diderot - Paris VII 
Boite 188, 175 rue du Chevaleret 
75013 Paris, France 

CREST-LS 
3 avenue Pierre Larousse 
92240 Malakoff, France 

alquier@math.jussieu.fr 

Gerard Biau 

LSTA & LPM7^E] 
Universite Pierre et Marie Curie - Paris VI 
Boite 158, Tour 15-25, 2eme etage 
4 place Jussieu, 75252 Paris Cedex 05, France 

Ecole Normale Superieure 
45 rue d'Ulm 
75230 Paris Cedex 05, France 

gerard.biauSupmc . f r 



Abstract 

Let (X, Y) be a random pair taking values in MP x M. In the so- 
called single-index model, one has Y = f*{6*^'K) + W, where /* is 
an unknown univariate measurable function, 9* is an unknown vec- 
tor in M'^, and W denotes a random noise satisfying E[iy|X] = 0. 
The single-index model is known to offer a flexible way to model a 
variety of high-dimensional real-world phenomena. However, despite 
its relative simplicity, this dimension reduction scheme is faced with 
severe complications as soon as the underlying dimension becomes 
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larger than the number of observations ( "p larger than n" paradigm) . 
To circumvent this difficulty, we consider the single-index model es- 
timation problem from a sparsity perspective using a PAC-Bayesian 
approach. On the theoretical side, we offer a sharp oracle inequal- 
ity, which is more powerful than the best known oracle inequalities 
for other common procedures of single-index recovery. The proposed 
method is implemented by means of the reversible jump Markov chain 
Monte Carlo technique and its performance is compared with that of 
standard procedures. 

Index Terms — Single-index model, sparsity, regression estimation, 
PAC-Bayesian, oracle inequality, reversible jump Markov chain Monte 
Carlo method. 

2010 Mathematics Subject Classification: 62G08, 62G05, 62G20. 

1 Introduction 

Let Vn = {(Xi, Yi), . . . , (X„, Yn)} be a collection of independent observa- 
tions, distributed as a generic independent pair (X, Y) taking values in R*' xM 
and satisfying KY'^ < oo. Throughout, we let P be the distribution of (X, y), 
so that the sample D„ is distributed according to P®". In the regression func- 
tion estimation problem, the goal is to use the data P„ in order to construct 
an estimate r„ : — )■ M of the regression function r(x) = E[y|X = x]. In 
the classical parametric linear model, one assumes 

Y = e*^x + w, 

where 9* - (^*, ...,6;^ e MP and E[W\X.] = 0. Here 

is a linear function of the components of x = {xi, . . . , Xp)^. More generally, 
we may define 

Y ^ f*{e*^X) + W, (1.1) 

where /* is an unknown univariate measurable function. This is the cele- 
brated single-index model, which is recognized as a particularly useful vari- 
ation of the linear formulation and can easily be interpreted: The model 
changes only in the direction 6*, and the way it changes in this direction 
is described by the function /*. This model has applications to a variety 
of fields, such as discrete choice analysis in econometrics and dose response 
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models in biometrics, where high-dimensional regression models are often 
employed. There are too many references to be included here, but the mono- 
graphs of McCuUagh and Nelder [33] and Horowitz |27j together with the 
references [251 EHl EU [TBI [29] will provide the reader with good introductions 
to the general subject area. 

One of the main advantages of the single-index model is its supposed ability 
to deal with the problem of high dimension (Bellman [6]). It is known that 
estimating the regression function is especially difficult whenever the dimen- 
sion p of X becomes large. As a matter of fact, the optimal mean square 
convergence rate n-'^'^/C^'^+p) for the estimation of a fc-times differentiable re- 
gression function converges to zero dramatically slowly if the dimension p is 
large compared to k. This leads to an unsatisfactory accuracy of estimation 
for moderate sample sizes, and one possibility to circumvent this problem 
is to impose additional assumptions on the regression function. Thus, in 
particular, if r(x) = f*{6*^x.) holds for every x G MP, then the underlying 
structural dimension of the model is 1 (instead of p) and the estimation of r 
can hopefully be performed easier. In this regard, Gai'ffas and Lecue show in 
|22j that the optimal rate of convergence over the single-index model class is 
^-2fe/(2fc+i) (^ijig|;ga(j of n~'^^/^'^'^^P^), thereby answering a conjecture of Stone 

Nevertheless, practical estimation of the link function /* and the index 9*^ 
still requires a degree of statistical smoothing. Perhaps the most common 
approach to reach this goal is to use a nonparametric smoother (for instance, 
a kernel or a local polynomial method) to construct an approximation /„ of 
/*, then substitute into an empirical version Rn{0) of the mean square 
error R{6) = K[Y — f{9'^X.)]'^, and finally choose 9n to minimize RniO) (see 
e.g. Hardle, Hall and Ichimura [25] and Delecroix, Hristache and Patilea [21] 
where the procedure is discussed in detail). The rationale behind this type 
of two-stage approach, which is asymptotic in spirit, is that it produces a 
i/n-consistent estimate of 9, thereby devolving the difficulty to the simpler 
problem of computing a good estimate for the one-dimensional function /*. 
However, the relative simplicity of this strategy is accompanied by severe 
difficulties (overfitting) when the dimension p becomes larger than the num- 
ber of observations n. Estimation in this setting (called "p larger than ra" 
paradigm) is generally acknowledged as an important challenge in contempo- 
rary statistics, see e.g. the recent monograph of Biihlmann and van de Geer 
[9]. In fact, this drawback considerably reduces the ability of the single-index 
model to behave as an effective dimension reduction technique. 

On the other hand, there is empirical evidence that many signals in high- 
dimensional spaces admit a sparse representation. As an example, wavelet 
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coefficients of images often exhibit exponential decay, and a relatively small 
subset of all wavelet coefficients allow for a good approximation of the orig- 
inal image. Such signals have few nonzero coefficients and can therefore be 
described as sparse in the signal domain (see for instance [8]). Similarly, 
recent advances in high-throughput technologies — such as array compara- 
tive genomic hybridization — indicate that, despite the huge dimensionality 
of problems, only a small number of genes may play a role in determining 
the outcome and be required to create good predictors ([13] for instance). 
Sparse estimation is playing an increasingly important role in the statistics 
and machine learning communities, and several methods have recently been 
developed in both fields, which rely upon the notion of sparsity (e.g. penalty 
methods like the Lasso and Dantzig selector, see [HI [TTl [TOl [7j and the ref- 
erences therein). 



In the present document, we consider the single-index model (1.1) from a 
sparsity perspective, i.e., we assume that 9* has only a few coordinates differ- 
ent from 0. In the dimension reduction scenario we have in mind, the ambient 
dimension p can be very large, much larger than the sample size n, but we 
believe that the representation is sparse, i.e., that very few coordinates of 9* 
are nonzero. This assumption is helpful at least for two reasons: If p is large 
and the number of nonzero coordinates is small enough, then the model is 
easier to interpret and its efficient estimation becomes possible. Our setting 
is close in spirit of the approach of Cohen, Daubechies, DeVore, Kerkyachar- 
ian and Picard [16] , who study approximation from queries of functions of the 
form f{9'^x.), where 9 is approximately sparse (in the sense that it belongs 
to a weak-^p space). However, these authors dot not provide any statistical 
study of their model. Our modus operandi will rather rely on the so-called 
PAC-Bayesian approach, originally developed in the classification context by 
Shawe- Taylor and Williamson [39], McAUester [32] and Catoni [121 [13]. This 
strategy was further investigated for regression by Audibert [1] and Alquier 
[T] and, more recently, worked out in the sparsity framework by Dalalyan 
and Tsybakov EO] and Alquier and Lounici [2j. The main message of 
[T9l 1201 E] is that aggregation with a properly chosen prior is able to deal 
nicely with the sparsity issue. Contrary to procedures such as the Lasso, the 
Dantzig selector and other penalized least square methods, which achieve fast 
rates under rather restrictive assumptions on the Gram matrix associated to 
the predictors, PAC-Bayesian aggregation requires only minimal assumptions 
on the model. Besides, it is computationally feasible even for a large p and 
exhibits good statistical performance. 

The paper is organized as follows. In Section [2| we ffist set out some notation 
and introduce the single-index estimation procedure. Then we state our main 
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result (Theorem 2.1), which offers a sparsity oracle inequality more power- 
ful than the best known oracle inequalities for other common procedures of 
single-index recovery. Section |3] is devoted to the practical implementation 
of the estimate via a reversible jump Markov chain Monte Carlo (MCMC) 
algorithm, and to numerical experiments on both simulated and real-life data 
sets. In order to preserve clarity, proofs have been postponed to Section |4] 
and the description of the MCMC method in its full length is given in the 
Appendix Section |5j 

Note finally that our techniques extend to the case of multiple-index models, 
of the form 

F = r(0t^X,...,CX) + W^, 

where the underlying structural dimension m is supposed to be larger than 

1 but substantially smaller than p. However, to keep things simple, we let 
m = 1 and leave the reader the opportunity to adapt the results to the more 
general situation m > 1. 

2 Sparse single-index estimation 

We start this section with some notation and basic requirements. 
2.1 Notation 

Throughout the document, we suppose that the recorded data Vn is gen- 



erated according to the single-index model (1.1). More precisely, for each 
z = 1, . . . ,n, 

Y, = r(r^x,) + w,, 

where /* is a univariate measurable function, 6* is a p-variate vector, and 
Wi, . . . , Wn are independent copies of W. We emphasize that it is implicitly 
assumed that the observations are drawn according to the true model under 
study. Therefore, this casts our problem in a nonparametric setting rather 
than in a classical PAC-Bayesian one. 



Recall that, in model ( pLlj ), E[iy|X] = and, consequently, that EW = 0. 
However, the distribution of W (in particular, the variance) may depend on 
X. We shall not precisely specify this dependence, and will rather require 
the following condition on the distribution of W. 

Assumption N. There exist two positive constants a and L such that, for 
all integers k > 2, 

E[\W\'\X] < -a^L^-\ 
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Observe that Assumption N holds in particular ii W — $(X)e, where e is 
a standard Gaussian random variable independent of X and $(X) is almost 
surely bounded. 

Let ll^lli denote the £i-norm of the vector 9 = {9i, . . . ,9p)'^ , i.e., ||^||i = 
Yl^=i Without loss of generality, it will be assumed throughout the 
document that the index 9* belongs to Sf_^_, where Sf_^ is the set of all 
9 with ll^lli = 1 and 9j(e) > 0, where j{9) is the smallest j G {1, . . . ,p} 
such that 9j 7^ 0. 

We will also require that the random variable X is almost surely bounded 
by a constant which, without loss of generality, can be taken equal to 1. 
Moreover, it will also be assumed that the link function f* is bounded by 
some known positive constant C. Thus, denoting by ||X||oo the supremum 
norm of X and by ||/*||oo the functional supremum norm of f* over [—1,1], 
we set: 

Assumption B. The condition ||X||oo < 1 holds almost surely and there 
exists a positive constant C larger than 1 such that ||/*||oo < C- 

Remark 2.1 To keep a sufficient degree of clarity, no attempt was made 

to optimize the constants. In particular, the requirement C > 1 is purely 
technical. It can easily be removed by replacing C by max[C, 1] throughout 
the document. 

In order to approximate the link function /*, we shall use the vector space 
spanned by a given countable dictionary of measurable functions {(fijj'jLi- 
Put differently, the approximation space T is the set of (finite) linear com- 
binations of functions of the dictionary. Each (pj of the collection is assumed 
to be defined on [—1, 1] and to take values in [—1, 1]. To avoid getting into 
too much technicalities, we will also assume that each is different iable 
and such that, for some positive constant £, \\^'j\\oo < ^j- This assumption is 
satisfied by the (non-normalized) trigonometric system 

(pi{t) = 1, ip2jit) = cos(7rjt), (P2j+iit) = sm{-Kjt), j = 1, 2, . . . 

Finally, for any measurable f : W ^ M. and 9 e »5f we let 

R(9,f)^E[{Y-f{9^X))'] 

and denote by 

1 

Rn{ej)^-Y.{Y,-f{9^y..)f 

1=1 

the empirical counterpart of R{9, f) based on the sample X>„. 



6 



2.2 Estimation procedure 

We are now in a position to describe our estimation procedure. The method 
which is presented here is inspired by the approach developed by Catoni 
in [121 US]- It strongly relies on the choice of a probability measure vr on 
iSf X J^, called the prior, which in our framework should enforce the sparsity 
properties of the target regression function. With this objective in mind, we 
first let 

d7r(0,/)=dMW/), 

i.e., we assume that the distribution over the indexes is independent of the 
distribution over the link functions. With respect to the parameter 6*, we put 

p / \ -1 



10- (•) 

Tr- !^ ^\ \ rl— „■ \ / 



MO) = - — ""7'"'rr; . (2.1) 

where |/| denotes the cardinality of / and dfii{6) is the uniform probability 
measure on the set 

Sl^{I) = {9 = {ei,...,ep) e 5f + ■■0j = O if and only if j ^ I}. 

We see that iSf ^(/) may be interpreted as the set of "active" coordinates in 
the single-index regression of F on X, and note that the prior on iSf _|. is a 
convex combination of uniform probability measures on the subsets iSf _,_(/). 
The weights of this combination depend only on the size of the active coordi- 
nate subset /. As such, the value \I\ characterizes the sparsity of the model: 
The smaller |J|, the smaller the number of variables involved in the model. 
The factor 10~* penalizes models of high dimension, in accordance with the 
sparsity idea. 

The choice of the prior on J-" is more involved. To begin with, we define, 
for any positive integer M < n and all A > 0, 

{M 
(/3i, . . . , Pm) e : < A and (3m + 

Next, we let Fm{.^) C -7-" be the image of Bm{,^) by the map 

(/5i,...,/3m) ^ ^f=iPj'Pj- 

It is worth pointing out that, roughly, Sobolev spaces are well approximated 
by J^m{^) as M grows (more on this in Subsection 2.3). Finally, we define 
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^M^df) on the set J-a/(C+1) as the image of the uniform measure on Bm{C+ 
1) induced by the map ^m, and take 



J2 lO-''duMif) 



Mf) = • (2.2) 

Some comments are in order here. First, we note that the prior tt is defined 
on Sf_^_ X J>j(C + 1) endowed with its canonical Borel cx-field. The choice 
of C + 1 instead of C in the definition of the prior support is essentially 
technical. This bound ensures that when the target /* belongs to J-'„(C), 
then a small ball around it is contained in J-'„(C + 1). It could be safely 
replaced by C + where {wnj^i is any positive sequence vanishing suf- 
ficiently slowly as n — )• oo. Next, the integer M should be interpreted as 
a measure of the "dimension" of the function / — the larger M, the more 
complex the function — and the prior u adapts again to the sparsity idea by 
penalizing large-dimensional functions /. The coefficients 10~* and 10~*^ 



which appear in (2.1) and (2.2) show that more complex models have a 
geometrically decreasing influence. Note however that the value 10, which 
has been chosen because of its good practical results, is somehow arbitrary. 
It could be, in all generality, replaced by a more general coefficient a at 
the price of a more technical analysis. Finally, we observe that, for each 

M 



oo 



<Y.\(^,\<c+i. 



Now, let A be a positive real number, called the inverse temperature param- 
eter hereafter. The estimates 6x and fx of 6* and /*, respectively, are simply 
obtained by randomly drawing 



7A, jx 



fx) ~ Pa, 



where p\ is the so-called Gibbs posterior distribution over 5f ^ x J^„(C + 1) 
defined by the probabihty density 

'^P^_(^Q f-^^ exp[-Ai?„(0,/)] 



dyr 



exp [-XRr,{9,f)]d7r{9,f) 



[The notation dpx/dTr means the density of px with respect to vr.] The es- 
timate {9x, fx) has a simple interpretation. Firstly, the level of significance 
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of each pair [9, /) is assessed via its least square error performance on the 
data P„. Secondly, a Gibbs distribution with respect to the prior tt enforcing 
those pairs (6', /) with the most empirical significance is assigned on the space 
iSf X Tn[C + 1). Finally, the resulting estimate is just a random realization 
(conditional to the data) of this Gibbs posterior distribution. 

2.3 Sparsity oracle inequality 

For any / C {1, . . . ,p} and any positive integer M < n, we set 

^ ' ' ' (e,/)e5f,+(/)xj-M(c) 

At this stage, it is very important to note that, for each M, the infimum //^ 
is defined on J-'m(C), whereas the prior charges a slightly bigger set, namely 
^m(C + 1). 

The main result of the paper is the following theorem. Here and everywhere, 
the wording "with probability 1 — 5" means the probability evaluated with 
respect to the distribution P®"^ of the data Vn and the conditional probability 
measure p\. Recall that £ is a positive constant such that ||</3j||oo < ^j- 

Theorem 2.1 Assume that Assumption N and Assumption B hold. Set 
w = 8{2C + 1) max[L, 2C + 1] and take 

^ " w + 2[{2C + IY + Aa^\ ^^'^^ 
Then, for all 5 G ]0, 1[, with probability at least 1 — 6 we have 



Riexjx)-Rie\n<E inf \Riei^,ji^,)-Rie\n 

IC{l,...,p} 
1 < M < n 

Mlog(C?2) + |/| \og{pn) + log (I) 
n 

where E is a positive constant, function of L, C , a and C. only. 

Remark 2.2 Interestingly enough, analysis of the estimate {Ox,f\) is still 
possible when Assumption N is not satisfied. Indeed, even if Bernstein's in- 



equality (see Lemma 4-1) is not valid, a recent paper by Seldin, Cesa-Bianchi, 
Laviolette, Auer, Shawe- Taylor and Peters fSE^ provides us with a nice alter- 
native inequality assuming less restrictive assumptions. However, we would 
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then suffer a loss in the upper hound of Theorem \2.1\ It is also interest- 
ing to note that recent results by Audibert and Catoni allow the study 
of PAC-Bayesian estimates without Assumption N. However, the results of 
these authors are valid for linear models only, and it is therefore not clear to 
what extent their technique can be transposed to our setting. 



Theorem can be given a simple interpretation. Indeed, we see that if there 
is a "small" I and a "small" M such that R{9'^ j^.^, fj^) is close to R{9*, /*), 

then R{9\, fx) is also close to R{0*, /*) up to terms of order 1/n. However, if 
no such / or M exists, then one of the terms Mlog{Cn)/n and |/| \og{pn)/n 
starts to dominate, thereby deteriorating the general quality of the bound. A 
good approximation with a "small" / is typically possible when 6* is sparse 
or, at least, when it can be approximated by a sparse parameter. On the 
other hand, a good approximation with a "small" M is possible if /* has a 
sufficient degree of regularity. 

To illustrate the latter remark, assume for instance that {ipj}'^^ is the (non- 
normalized) trigonometric system and suppose that the target /* belongs to 
the Sobolev ellipsoid, defined by 



for some unknown regularity parameter k > 2 (see, e.g., Tsybakov |12])- 
Observe that, in this context, the approximation sets Fm{C + 1) take the 
form 

^m(C+1) 

MM ^ 

f e L2([-l, l]):f = Y, 5^ < C + 1 and /3m 7^ I . 

i=i i=i J 

It is important to note that the regularity parameter k is assumed to be 
unknown, and this casts our results in the so-called adaptive setting. The 
following additional assumption will be needed: 

Assumption D. The random variable 6'*"^X has a probability density on 
[—1,1], bounded from above by a positive constant B. 

Last, we let J* be the set I such that 9* G 5f _,_(/) and set ||^*||o = 

Corollary 2.1 Assume that Assumption N , Assumption^ and Assumption 
D hold. Suppose also that f* belongs to the Sobolev ellipsoid W(A;, GC^/tt^), 
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where the real number k > 2 is an (unknown) regularity parameter. Set 
w = 8{2C + 1) max[L, 2C + 1] and take X as zn (2^ . Then, for all 5 e]0,l [, 
with probability at least 1 ~ 6 we have 




Ik 

\og{Cn)\^^ , ll^lloMH ^ Mil . (24) 



n J n n 



where S' is a positive constant, function of L, C , a , I and B only. 

As far as we are aware, all existing methods achieving rates of convergence 



similar to the ones provided by Corollary |2.1| are valid in an asymptotic 
setting only (p fixed and n — )■ oo). The strength of Corollary 2.1 is to 
provide a finite sample bound and to show that our estimate still behaves 
well in a nonasymptotic situation if the intrinsic dimension (i.e., the sparsity) 
is small with respect to n. To understand this remark, just assume that p is 
a function of n such that p — )■ oo as n — )■ oo. Whereas a classical asymptotic 
approach cannot say anything clever about this situation, our bounds still 
provide some information, provided the model is sparse enough (i.e., ||^*||o 
is sufficiently small with respect to n). 

We see that, asymptotically [ p fix ed and n — )■ oo), the leading term on the 



right-hand side of inequality (2.4) is (log(?T,)/n) 2'=+i . This is the minimax 
rate of convergence over a Sobolev class, up to a log(n) factor. However, 
when n is "small" and 6* is not sparse (i.e., ||^*||o is not "small"), the term 
||^^*||o \og{pn)/n starts to emerge and cannot be neglected. Put differently, in 
large dimension, the estimation of 9* itself is a problem — this phenomenon 
is not taken into account by asymptotic studies. 

It is worth mentioning that the approach developed in the present article does 
not offer any guarantee on the point of view of variable (feature) selection. To 
reach this objective, an interesting route to follow is the sufficient dimension 
reduction (SDR) method proposed by Chen, Zou and Cook [15], which can 
be applied to the single-index model to estimate consistently the parameter 
9* and perform variable selection in a sparsity framework. Note however that 
such results require strong assumptions on the distribution of the data. 



Finally, it should be stressed that the choice of A in Theorem 2.1 and Corol- 



lary 2.1 is not the best possible and may eventually be improved, at the price 



of a more technical analysis however. 
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3 Implementation and numerical results 



A series of experiments was conducted, both on simulated and real-life data 
sets, in order to assess the practical capabilities of the proposed method and 
compare its performance with that of standard procedures. Prior to analysis, 
we first need to discuss its concrete implementation, which has been carried 
out via a Markov Chain Monte Carlo (MCMC) method. 

3.1 Implementation via reversible jump MCMC 

The use of MCMC methods has become a popular way to compute Bayesian 
estimates. For an introduction to the domain, one should refer to the compre- 
hensive monograph of Marin and Robert [20] and the references therein. Im- 
portantly, in this computational framework, an adaptation of the well-known 
Hastings-Metropolis algorithm to the case where the posterior distribution 
gives mass to several models of different dimensions was proposed by Green 
[23] under the name Reversible Jump MCMC (RJMCMC) method. In the 
PAC-Bayesian setting, MCMC procedures were first considered by Catoni 
[12j . whereas Dalalyan and Tsybakov [191 EQ] and Alquier and Lounici [2] 
explore their practical implementation in the sparse context using Langevin 
Monte Carlo and RJMCMC, respectively. Regarding the single-index model, 
MCMC algorithms were used to compute Bayesian estimates by Antoniadis, 
Gregoire and McKeague [3] and, more recently, by Wang [H], who develop 
a fully Bayesian method to analyse the single-index model. Our implemen- 
tation technique is close in spirit to the one of Wang [44j . 

As a starting point for the approximate computation of our estimate, we 
used the RJMCMC method of Green [23], which is in fact an adaptation of 
the Hastings-Metropolis algorithm to the case where the objective posterior 
probability distribution (here, px) assigns mass to several different models. 
The idea is to start from an initial given pair {9^'^\ f^^^) G 5f x J>^(C + 1) 
and then, at each step, to iteratively compute (6'^*"'"^'',/*^*+^)) from {0^*\ f^^^) 
via the following chain of rules: 

• Sample a random pair (r*^*\ h^^^) according to some proposal conditional 
density kt{ . 1(6*^*^ /'•*■')) with respect to the prior vr; 



• Take 




with probability at 
with probability 1 — at, 
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where 

. A ^(rW,/.W)xfc,((0W,/W)|(rW,/.W))\ 

at = mm 1, . 

V ^(^W,/«)x A;, ((rW,/.W)|(0W, /(*))); 

This protocol ensmes that the sequence /'•*•' )}^o ^ Markov chain 

with invariant probabihty distribution px (see e.g. Marin and Robert |30]). 
A usual choice is to take kf = k, so that the Markov chain is homogeneous. 
However, in our context, it is more convenient to let kt = ki if t is odd and 
kf = ^2 if t is even. Roughly, the effect of ki is to modify the index 9^^^ while 
^2 will essentially act on the link function f^*\ While the ideas underlying 
the proposal densities ki and k2 are quite simple, a precise description in 
its full length turns out to be more technical. Thus, in order to preserve 
the readability of the paper, the explicit construction of ki and k2 has been 
postponed to the Appendix Section [5j 

From a theoretical point of view, it is clear that the implementation of our 
method requires knowledge of the constant C (the upper bound on ||/*||oo)- 
A too small C will result in a smaller model, which is unable to perform a 
good approximation. On the other hand, a larger C induces a poor bound 
in Theorem 2.1. In practice, however, the influence of C turns out to be 
secondary compared to the impact of the parameter A. Indeed, it was found 
empirically that a very large choice of C (e.g., C = 10^°°) does not deteriorate 
the overall quality of the results, as soon as A is appropriately chosen. This 
is the approach that was followed in the experimental testing process. 

Besides, the time for the Markov chains to converge depends strongly on 
the ambient dimension p and the starting point of the simulations. When 
the dimension is small (typically, p < 10), the chains converge fast and any 
value may be chosen as a starting point. In this case, we let the MCMC 
run 1000 steps and obtained satisfying results. On the other hand, when the 
dimension is larger (typically, p > 10), the convergence is very slow, in the 
sense that Rn{9^^\ f^^^) takes a very long time to stabilize. However, using 
as a starting point for the chains the preliminary estimate 6'jjjjj (see below) 
significantly reduces the number of steps needed to reach convergence — we 
let the chains run 5000 steps in this context. Nevertheless, as a general rule, 
we encourage the users to inspect the convergence of the chains by checking 
if Rn{9^^\ /^*^) is stabilized, and to run several chains starting from different 
points to avoid their attraction into local minima. 
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3.2 Simulation study 

In this subsection, we illustrate the finite sample performance of the presented 
estimation method on three synthetic data sets and compare its predictive 
capabilities with those of three standard statistical procedures. In all our ex- 
periments, we took as dictionary the (non-normalized) trigonometric system 
{ipj}'^i and denote accordingly the resulting regression function estimate 
defined in Section [2] by -Ppourier- accordance with the order of magnitude 
indicated by the theoretical results, we set A = 4n. This choice can undoubt- 
edly be improved a bit but, as the numerical results show, it seems sufficient 
for our procedure to be fairly competitive. 

The tested competing methods are the Lasso (Tibshirani [H]), the stan- 
dard regression kernel estimate (Nadaraya [Ml [35] and Watson [15] , see also 
Tsybakov [12]), and the estimation strategy discussed in Hardle, Hall and 
Ichimura [25]. While the procedure of Hardle, Hall and Ichimura is specifi- 
cally tailored for single-index models, the Lasso is designed to deal with the 
estimation of sparse linear models. On the other hand, the nonparametric 
kernel method is one of the best options when no obvious assumption (such 
as the single-index one) can be made on the shape of the targeted regression 
function. 

We briefiy recall that, for a linear model of the form Y = 6*^^ + W, the 
Lasso estimate takes the form -FLasso(^) = ^Lasso-^' '^^ere 

{-, n P 
i=i j=i 

and ^ > is a regularization parameter. Theoretical results (see e.g. Bunea, 
Tsybakov and Wegkamp [lOj) indicate that ^ should be of the order ^* = 
a^y\og{p)/n. Throughout, a is assumed to be known, and we let ^ = ^*/3, 
since this choice is known to give good practical results. The Nadaraya- 
Watson kernel estimate will be denoted by -Fnw- is defined by 



Er=ii^.(x-x,) 

for some nonnegative kernel K on W and Kh{z) = K{z/h)/h. In the ex- 
periments, we let K be the Gaussian kernel K[z) = exp[—z^z) and chose 
the smoothing parameter h via a classical leave-one-out procedure on the 
grid Q = {0.75^, k = 0, . . . , [log(n)J}, see, e.g., Gyorfi, Kohler, Krzyzak and 
Walk [2l] (notation [.J stands for the fioor function). Finally, the estimation 
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procedure advocated in Hardle, Hall and Ichimura [25J takes the form 
i^HHl(x) = 7^ 

^tiGh (^5hi(^-X.)) 
for some kernel G on M, with Gh{z) = G{z/h)/h and 

n 

(^)^HHl)^arg min 

i=l 

All calculations were performed with the Gaussian kernel. We used the grid 
Q for the optimization with respect to h, whereas the best search for 6 was 
implemented via a pathwise coordinate optimization. 

The various methods were tested for the general regression model 

Yi = F{Xi) + W,, t = l,...,n, 

for three different choices of F (single-index or not) and two values of n, 
namely n = 50 and tt, = 100. In each of these models, the observations Xj 
take values in MP, with p = 10 and p = 50, and have independent compo- 
nents uniformly distributed on [—1,1]. The noise variables Wi, . . . , Wn are 
independently distributed according to a Gaussian Af{0,cr^), with a = 0.2. 
It is worth pointing out that for n = 50 and p = 50, p and n are of the 
same order, which means that the setting is nonasymptotic. It is essentially 
in this case that the use of estimates tailored to sparsity, which reduce the 
variance, is expected to improve the performance over generalist methods. 
On the other hand, the situation n = 100 and p = 10 is less difficult and 
mimics the asymptotic setting. 

The three examined functions -F(x), for x = (xi, . . . , Xp), were the following 
ones: 

[Model 1] A linear model -FLinear(x) = 26'*^x. 

[Model 2] A single-index function Fsi(x) = 2{e''^x)'^ + 6'*^x. 

[Model 3] A purely nonparametric model F]s^p(x) = 2|a;2| A/|a;i| — Xg, 

where, in the first and second model, 6* = (0.5, 0.5, 0, . . . , 0)"^. Thus, in 
[Model 1] and [Model 2], even if the ambient dimension is large, the in- 
trinsic dimension of the model is in fact equal to 2. 



E,^^Y,G, (g^(X,-X.)) 

(r(x,-x,)) 
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n = 50 


p = 10 


-^Fourier 


-^HHI 


-^Lasso 




-^Linear 


median 


0.061 


0.063 


0.046 


0.293 




mean 
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n = 100 


p = 10 


-^Fourier 


-^HHI 


-^Lasso 




-'^Linear 


median 


0.053 


0.051 


0.042 


0.227 




mean 


0.056 


0.050 


0.043 


0.237 




b.d. 


0.011 


0.006 


0.004 


0.044 


^SI 


median 


0.047 


0.052 


0.332 


0.209 




mean 


0.049 


0.053 


0.337 


0.218 




b.d. 


0.009 


0.012 


0.063 


0.045 




median 


0.305 


0.343 


0.793 


0.333 




mean 


0.321 


0.338 


0.833 


0.324 




b.d. 


0.092 


0.042 


0.145 


0.041 



Table 1: Numerical results for the simulated data, with n = 50 and n = 100, 
p = 10. The characters in bold indicate the best performance. 



For each experiment, a learning bet of bize n was generated to compute the 
estimateb and their performance, in termb of mean bquare previbion error, wab 
evaluated on a beparate tebt bet of the bame bize. The rebultb are bhown in 
Table [T] (p = 10) and Table [2] (p = 50). As each experiment wab repeated 20 
times, thebe tableb report the median, the mean and the btandard deviation 
(b.d.) of the previbion error of each procedure. 

Some commentb are in order. Firbt, we note without burpribe that: 

1. The Labbo performb well in the linear betting [Model 1]. 

2. The bingle-index methodb -^Fourier -^HHI the bebt oneb when 
the targeted regrebbion function really involveb a bingle-index model 
[Model 2]. 

3. The kernel method giveb good rebultb in the purely nonparametric bet- 
ting [Model 3]. 
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n = 50 


p = 50 


-^Fourier 


-^HHI 


-^Lasso 




-^Linear 


median 


0.057 
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0.060 


0.507 




mean 


n not; 


110/1 


U.UDD 


U.Ooo 




b.Q. 


n 1 /I'? 

U. i^tO 


n 9/1 1 


u.uzo 


n nsi 

U.Uoi 


^SI 


median 


n nKn 
U.UbU 




U. / yo 


U.oUo 




llicdii 




U.OOcf 


n 77(S 






S.U.. 


U.Uii 


u.zuu 


U.ZUo 


n 1 no 
u. iuy 




median 


U.oDo 


U. ( oo 


1 ni n 
i.yiu 






TTl n 

llicdli 


n ^04 


n 771 

U. 1 1 -L 








b.Q. 


u.ozu 


n 1 fis 

U. IDo 


n /IRS 


n 1 ni 

U.iUi 


n = 100 


p = 50 


-^Fourier 


-^HHI 


-^Lasso 




-'^Linear 


median 


0.053 


0.092 


0.050 


0.519 




mean 


0.054 


0.100 


0.050 


0.508 




s.d. 


0.007 


0.026 


0.006 


0.026 


^SI 


median 


0.047 


0.242 


0.503 


0.329 




mean 


0.070 


0.267 


0.502 


0.339 




s.d. 


0.099 


0.111 


0.106 


0.073 




median 


0.361 


0.736 


1.968 


0.418 




mean 


0.557 


0.765 


2.045 


0.406 




s.d. 


0.519 


0.226 


0.546 


0.076 



Table 2: Numerical results for the simulated data, with n = 50 and n = 100, 
p = 50. The characters in bold indicate the best performance. 



Interestingly, -^Fourier provides slightly better results than the single-index- 
tailored estimate Fum, especially for p = 50. This observation can be easily 
explained by the fact that i^uHI does not integrate any sparsity information 
regarding the parameter 9*, whereas -fVourier ^^i^s to focus on the dimension 
of the active coordinates, which is equal to 2 in this simulation. As a general 
finding, we retain that -pFourier is the most robust of all the tested procedures. 

3.3 Real data 

The real-life data sets used in this second series of experiments are from two 
different sources. The first one, called AIR-QUALITY data (n = 111, 
p = 3), has been first used by Chambers, Cleveland, Kleiner and Tukey 
[H] and has been later considered as a benchmark in the study and com- 
parison of single-index models (see, for example, Antoniadis, Gregoire and 
McKeague [3] and Wang [H], among others). This data set originated from 
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an environmental study relating n = 111 ozone concentration measures at 
p = 3 meteorological variables, namely wind speed, temperature and radi- 
ation. The data is available as a package in the software R which we 
employed in all the numerical experiments. The programs are available upon 
request from the authors. 

The second category of data arises from the UC Irvine Machine Learning 
Repository http : / / archive . ics . uci . edu/ml, where the following packages 
have been downloaded from: 

• AUTO-MPG (Quinlan [36], n = 392, p = 7). 

• CONCRETE (Yeh 06], n = 1030, J9 = 8). 

• HOUSING (Harrison and Rubinfeld |26], n = 508, p=13). 

• SLUMP-1, SLUMP-2 and SLUMP-3, which correspond to the con- 
crete slump test data introduced by Yeh ^\ {n = 51, p = 7). Since 
there are 3 different output variables Y in the original data set, we 
created a single experiment for each of these variables (1 refers to the 
output "slump", 2 to the output "flow" and 3 to the output "28-day 
Compressive Strength"). 

• WINE-RED and WINE- WHITE (Cortez, Cerdeira, Almeida, Ma- 
tos and Reis [H], n = 1599, n = 4898, p = 11). 

We refer to the above-mentioned references for a precise description of the 
meaning of the variables involved in these data sets. For homogeneity rea- 
sons, all data were normalized to force the input variables to lie in [—1, 1] — in 
accordance with the setting of our method — and to ensure that all output 
variables have standard deviation 0.5. In two data sets (AIR-QUALITY 
and AUTO-MPG) there were some missing values and the corresponding 
observations were simply removed. 

For each method and each of the nine data sets, we randomly split the obser- 
vations in a learning and a test set of equal sizes, computed the estimate on 
the learning set, evaluated the prediction error on the test set, and repeated 
this protocol 20 times. The results are summarized in Table |3j 

We see that all the tested methods provide reasonable results on most data 
sets. The Lasso is very competitive, especially in the nonasymptotic frame- 
work. The estimation procedure -Fpourier offers outcomes which are similar 
to the ones of -Fhhl with a slight advantage for the latter method however. 
Altogether, -Fpourier ^^^1 -^hhi provide the best performance in terms of pre- 
diction error in 6 out of 9 experiments. Besides, when it is not the best, the 
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Data set 




-^Fourier 


-^HHI 


-^Lasso 




AIR QUALITY 


median 


0.117 


0.099 


0.107 


0.129 


n = 111 


mean 


0.128 


0.096 


0.113 


0.130 


p = 3 


s.d. 


0.044 


0.029 


0.029 


0.035 


AUTO-MPG 


median 


0.044 


0.049 


0.070 


0.068 


n = 392 


mean 


0.051 


0.050 


0.072 


0.069 


p = 7 


s.d. 


0.017 


0.006 


0.011 


0.009 


CONCRETE 


median 


0.089 


0.087 


0.106 


0.094 


n = 1030 


mean 


0.091 


0.087 


0.107 


0.094 


p = 8 


s.d. 


0.008 


0.003 


0.005 


0.004 


HOUSING 


median 


0.074 


0.059 


0.086 


0.086 


n = 508 


mean 


0.076 


0.061 


0.085 


0.088 


p = ll 


s.d. 


0.015 


0.013 


0.012 


0.016 


SLUMP-1 


median 


0.289 


0.171 


0.201 


0.208 


n = 51 


mean 


0.244 


0.187 


0.213 


0.226 


p = 7 


s.d. 


0.062 


0.050 


0.049 


0.047 


SLUMP-2 


median 


0.219 


0.196 


0.172 


0.215 


n = 51 


mean 


0.216 


0.194 


0.171 


0.213 


p = 7 


s.d. 


0.053 


0.025 


0.019 


0.022 


SLUMP-3 


median 


0.065 


0.070 


0.053 


0.116 


n = 51 


mean 


0.073 


0.079 


0.052 


0.126 


p = 7 


s.d. 


0.033 


0.027 


0.010 


0.026 


WINE-RED 


median 


0.173 


0.171 


0.183 


0.171 


n = 1599 


mean 


0.174 


0.170 


0.174 


0.183 


p = ll 


s.d. 


0.009 


0.008 


0.007 


0.010 


WINE- WHITE 


median 


0.191 


0.187 


0.185 


0.184 


n = 4898 


mean 


0.202 


0.188 


0.186 


0.185 


p = ll 


s.d. 


0.045 


0.003 


0.004 


0.004 



Table 3: Numerical results for the real-life data sets. The characters in bold 
indicate the best performance. 

method -^Fourier is close to the best one, as for example in SLUMP-3 and 
WINE-RED. As an illustrative example, the plot of the resulting fit of our 
procedure to the data set AUTO-MPG is shown in Figure [1} 



Clearly, all data sets under study have a dimension p which is small compared 
to n. To correct this situation, we ran the same series of experiments by 
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o I 1 I I I I I 

-0.40 -0.35 -0.30 -0.25 -0.20 -0.15 -0.10 

Figure 1: AUTO-MPG example: Estimated link function by the method 

-^Fourier- 

adding some additional irrelevant dimensions to the data. Specifically, the 
observations were embedded into a space of dimension p x 4 by letting the 
new fake coordinates follow independent uniform [0, 1] random variables. The 
results are shown in Table |4j In this nonasymptotic framework, the method 
Fhhi — which is not designed for sparsity — collapses, whereas -Fpourier takes 
a clear advantage over its competitors. In fact, it provides the best results in 
3 out of 9 experiments (AUTO-MPG, CONCRETE and HOUSING). 
Besides, when it is not the best, the method -^Fourier is very close to the best 
one, as for example in SLUMP-3 and WINE-RED. 

Thus, as a general conclusion to this experimental section, we may say that 
our PAC-Bayesian oriented procedure has an excellent predictive ability, even 
in nonasymptotic/high-dimensional situations. It is fast, robust, and exhibits 
performance at the level of the gold standard Lasso. 
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Augmented data set 




-^Fourier 


-^HHI 


-^Lasso 




AIR QUALITY 


median 


0.172 


0.272 


0.164 


0.281 


n = 111 


mean 


0.244 


0.291 


0.163 


0.291 


p = 12 


s.d. 


0.163 


0.116 


0.038 


0.046 


AUTO-MPG 


median 


0.043 


0.062 


0.085 


0.202 


n = 392 


mean 


0.044 


0.072 


0.086 


0.203 


p = 28 


s.d. 


0.009 


0.018 


0.008 


0.014 


CONCRETE 


median 


0.087 


0.093 


0.113 


0.245 


n = 1030 


mean 


0.087 


0.094 


0.112 


0.094 


p = 32 


s.d. 


0.007 


0.008 


0.005 


0.009 


HOUSING 


median 


0.071 


0.199 


0.092 


0.226 


n = 508 


mean 


0.075 


0.181 


0.095 


0.227 


p = 44 


s.d. 


0.023 


0.084 


0.013 


0.018 


SLUMP-1 


median 


0.270 


0.426 


0.276 


0.271 


n = 51 


mean 


0.290 


0.409 


0.274 


0.262 


p = 44 


s.d. 


0.101 


0.079 


0.055 


0.042 


SLUMP-2 


median 


0.276 


0.332 


0.195 


0.253 


n = 51 


mean 


0.285 


0.349 


0.198 


0.254 


p = 44 


s.d. 


0.075 


0.063 


0.043 


0.034 


SLUMP-3 


median 


0.079 


0.371 


0.061 


0.372 


n = 51 


mean 


0.082 


0.361 


0.058 


0.279 


p = 28 


s.d. 


0.025 


0.079 


0.013 


0.031 


WINE-RED 


median 


0.178 


0.222 


0.172 


0.245 


n = 1599 


mean 


0.176 


0.226 


0.174 


0.246 


p = 44 


s.d. 


0.085 


0.033 


0.006 


0.029 


WINE- WHITE 


median 


0.199 


0.239 


0.187 


0.252 


n = 4898 


mean 


0.204 


0.256 


0.188 


0.260 


p= 11 


s.d. 


0.091 


0.041 


0.005 


0.019 



Table 4: Numerical results for the real-life data sets augmented with noise vari- 
ables. The characters in bold indicate the best performance. 

4 Proofs 

4.1 Preliminary results 

Throughout this section, we let tt be the prior probability measure on x 
J^n{C + 1) equipped with its canonical Borel a-field. Recall that TniC + 1) C 
T and that, for each / e TniC -|- 1), we have ||/||oo < C -|- 1. 
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Besides, since E[y|X] = /*(0*^X) almost surely, we note once and for all 
that for all {6, /) G 5f + x J-„(C + 1), 



R{e, /) - R{e\ n = E[Y- fie^iL)] '-E[Y-f 



X 



X 



(Pythagora's theorem). We start with four technical lemmas. Lemma 4.1 
a version of Bernstein's inequality, whose proof can be found in Massart 
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Chapter 2, inequality (2.21)]. Lemma 4.2 is a classical result, whose proof 
can be found, for example, in Catoni [131 P^i^ge 4]. For a random variable Z, 
the notation {Z)+ means the positive part of Z. 

Lemma 4.1 Let Ti, . . . , T„ be independent real-valued random variables. As- 
sume that there exist two positive constants v and w such that, for all integers 
k>2, 



i=l 



Then, for any ( e]0,1/w[. 



E 



i=l 



< exp 



<2 



2(1 -K) 



Given a measurable space {E, S) and two probability measures fii and /i2 on 
{E,S), we denote by /C(/ii,/i2) the KuUback-Leibler divergence of with 
respect to fi2, defined by 




d/ii if /ii < /i2, 
otherwise. 



(Notation fii <^ fj,2 means "/xi is absolutely continuous with respect to ■) 



Lemma 4.2 Let {E,£) be a measurable space. For any probability measure 
fi on {E, £) and any measurable function h : E -^M. such that J (exp oh)dfi < 
00, we have 

log J {expoh)dfi = sup hdm — /C(m,yu)^ , (4.1) 

where the supremum is taken over all probability measures on {E, £) and, by 
convention, cxd — 00 = —00. Moreover, as soon as h is bounded from above 
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on the support of fi, the supremum with respect to m on the right-hand side 



of (4-1) is reached for the Gibbs distribution g given by 



dg^ 
djj, 



exp [h{e)] 
(exp oh)dfi 



eeE. 



Lemma 4.3 Assume that Assumption N holds. Set w = 8(2C + 1) max[L, 
2C + 1] and take 

n 

^ J 'w7+[(2C+l)2 + 4a2] _■ 

Then, for all 5 g]0, 1[ and any data- dependent probability measure p abso- 
lutely continuous with respect to vr we have, with probability at least 1 — S, 



R{e, f) - R{e\ r 
1 

< 



A[(2C+l)2+4(72] 



1 



Rn{ej)~Rn{e\f'') + 



n—wX 



log ( f(^,/))+logQ) 



where the pair {9, /) is distributed according to p. 

Fix 9 E Sf^ and / G J-'„(C + 1). The proof starts 
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Proof of Lemma 



with an apphcation of 



Lemma |4.1| to the random variables 

2 



T, = - (r, - fi9^x,)Y + {Y, - r(r^x,))' 



i = 1, . . . , n. 



Note that these random variables are independent, identically distributed, 
and that 



n 

Ee { - f{9^X,) - r(r^X,)]' [fi9^X,) - f 

i=l 

n 

Ee { [2W, + r(r^X,) - f{9^X,)f [f{9^X,) - f 



i=l 



X, 



X, 



<Y,^{[AWf + {2C + If] [f{9^X,) - / 

i=l 

(since E[Wi|Xi] = 0). 
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Thus, by Assumption N, 

n n 

i=i 1=1 
<v, 

where we set 

V = 2n[(2C + If + V] [R{e, /) - R{e*, /*)] . 
More generally, for all integers A; > 3, 

n 
i=l 

n 

< {|2y, - fie^x,) - r(r^x,)|' |/(^^x,) - r(r^xo|'} 

i=l 

1=1 

n 

< 2*^-1 J] E I [2^=1 Wil'^ + (2C + 1)'=] (2C + If-' |/(^^Xi) - /* (r^X^ f } 



i=l 



In the last inequality, we used the fact that |a+6|*^ < 2*^-^(|a|'= + |6|'=) together 
with 

ifc 



l/(rx,)-r(^*^x,)f 
= l/(^^x,) - r(^*^x,)|'-' X |/(rx,) - rr^xop 

< (2C+l)'=-2|/(rX,)-r(r^X,)f. 
Therefore, by Assumption N, 

EE[(T,)y 

i=l 

< ^ [22*^-2A;!(7'L'=-2 + 2'=-^(2C + 1)'=] (2C + l)'^-^ [R{e, f) - R{d\ /*)] 

i=l 

[22^-2A;!(72L'=-2 + 2^-i(2C + 1)'=] (2C + l)*^-^ 

[(2C + l)2 + 4a2] 

S'^-^A;! max [L*^-^, (2C + l)'^-^] (2C + 1)^-^ 
_ v X ^ 
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with w = 8(2C + 1) max[L, 2C + 1]. 

Thus, for any inverse temperature parameter A E]0,n/w[, taking ( 



\/n, 



we may write by Lemma 4.1 

e| exp [A (Rie, f) - R{9\ D - /) + D)] } 

< exp 



2^2(1 - ^) 
Therefore, using the definition of v, we obtain 

A2 [(2C + 1)2 + 



E< exp 



A 



nil 



+ x{-R4ej) + R4e\n) -log 



{R{ej)-R{e\n) 



< 6. 



Next, we use a standard PAC-Bayesian approach (Catoni [12], [13], Audibert 
[1] and Alquier [T]). Let us remind the reader that vr is a prior probabihty 
measure on the set 5f__|_ x J^„(C + 1). We have 



E< exp 



+ A(-i?„(^,/) + i?„(r,r))-iog 



>dn{9,f)<6 



and consequently, using Fubini's theorem, 
A2 [(2C+l)2 + 4a2] 



^< /exp |A 

+ A(-i?„(e,/) + i?„(r,r))-iog 



iR{e,f)-R{d\n) 
1 



dniej)}<5. 



Therefore, for any data-dependent posterior probabihty measure p absolutely 
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continuous with respect to vr, adopting the convention oo x = 0, 

A2 [(2C + 1)2 + 4(t2] 



E<^ / exp 



A 



+ x{-R4e,f) + R4e\n) 



iRiej)-R{e\n) 



log 



dvr ' 



,/) -log 



dpiO, f) 



< S. 



RecaUing that P*^" stands for the distribution of the sample P„, the latter 
inequality can be more conveniently written as 



i,„^P®nE(^_^-)_^< exp 



A 



A2 [(2^+1)2 + 4^2 



n(l-^) 
+ \{-Rn{9,f) + Rn{9\n) - log 



dp ^ 
dn 



R{ej)-R{e\r, 



f) - lo? 



< 5. 



Thus, using the elementary inequality exp(Ax) > 1m^(x) we obtain, with 
probabihty at most 5, 



A[(2C + 1)2 + 4(t2 



nil - 



wX 



R{eJ)-R{e\r 



>Rn{eJ)-Rn{e\n + 



log^(^^,/) +logQ) 



A 



where the probability is evaluated with respect to the distribution P®" of the 
data Dn and the conditional probability measure p. Put differently, letting 



A e 



0, 



n 



^^;+ [(2C + 1)2 + 4ct2] 
we have, with probability at least 1 — 5, 

R{eJ)-R{e\n 



< 



Rn{ej)-Rn{e\n + 



logf§^(^,/)j+logQ) 



n—wX 



A 



This concludes the proof of Lemma 4.3 
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Lemma 4.4 Under the conditions of Lemma \4-3\ we have, with probability 
at least 1 — S, 



fl,.(9,/)d^(9, /)-«„(«*,/•) 



< 



+ 



/C(p,vr) + log(|) 



A 



Proof of Lemma 4.4 The beginning of the proof is similar to the one of 
Lemma 4.3 More precisely, we apply Lemma 4.1 with Tj = (Fj — /(^-^Xj))^ — 
{Yi — f*{6*^^i)y and obtain, for any inverse temperature parameter A G 
]0,n/w[, 

e| exp [A n - Rie, f) - n + /))] } 



< 



exp 



* n > 

Thus, using the definition of f , 

A2 \{2C + 1)2 + 4^2] 



E< exp 



A + 



nil - ^^^^ 

\ n > 



+ A(i?„(^,/)-i?„(r,r))-iog 



(i?(r,r)-i?(^,/)) 

r 



< b. 



Integrating with respect to vr leads to 

A2 [(2C + 1)2 + 4(7^] 



E< exp 



A + 



n(l-^) 
+ A(i?„(^,/)-i?„(r,r))-log 



1 



whence, by Fubini's theorem, 

A^ [(2C + l)2 + 4a2] 



exp 



A + 



n(l-f) 

+ A(i?„(^,/)-i?„(r,r))-iog 



(/?(r,r)-i?(^,/)) 
1 



d7r(^^,/)^<5. 



27 



Thus, for any data-dependent posterior probability measure p absolutely 
continuous with respect to tt, 



E 



+ \{Rn{dJ)-Rn{e\n) 



< 5. 



Therefore, by Jensen's inequaUty, 

A2 [(2C + 1)2 + 4a2] 



E< exp 



A + 



n(l - ^) 



{R{e*,n-R{9,f)) 



+ X{Rn{9J)-Rn{9\n) 
-log(|(.,/))-log(i 



E< cxp 



A + 



A2 [(2C + 1)2 + 4a' 



+ x(^j Rn{e,f)dp{e,f)-Rn{e\n 



dp{d,f) 

R{9\n- j R{e,f)dp{e,f) 



K, {p, tt) - log 



< S. 



Consequently, by the elementary inequality exp(Ax) > 1k_^(x), we obtain, 
with probability at most 6, 



R^(e,f)A~p(e,f)-R„{0',r) 



+ 



/C(p,7r)+log(i) 



A 
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Equivalently, with probability at least 1 — 5, 

R^iej)dpie,f)-R„ie\n 

/C(p,7r) + log(i) 



< 



+ 



A 



4.2 Proof of Theorem 2.1 



The proof starts with an application of Lemma 4.3 with p = px (the Gibbs 
distribution) as posterior distribution. More precisely, we know that, with 
probability larger than 1 — 5, 



R{e,,f\)-R{e\r 
1 

< - 



A[(2C+1)2+4ct2] 



1 - 



n—wX 



Rni^X, fx) — RniO*, f* 



+ 



log^(^A,/A) +l0g(i) 



A 



where the probability is evaluated with respect to the distribution P*^" of 
the data and the conditional probability measure p\. Observe that 



l0g(^(^A,/A))=l0g 



exp 


— XRniOx, fx) 




j exp[- 


-\R^{9,f)]d7i{9,f) 



\ 



= -XRniOx, fx) - log J exp [-Ai?„(e, /)] d7r(0, /). 
Consequently, with probability at least I — S, 

R{exJx)~R{e\f') 



< 



A 1- 



A[(2C+l)2+4a2] 
n—wX 



log / exp[-Ai?„(^,/)]d7r(^,/) 



A/?„(r,r) + log 
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Next, using Lemma [42] we deduce that, with probabihty at least 1 — 5, 



1 

< - 



^ _ A[(2C+l)2+4a2] ^ 



inf <^ / Rn{9,f)dp{9J)-Rn{9\r 



+ 



;C(p,vr) + logQ) 



A 



where the infimum is taken over all probability measures on iSf _|. x J^„(C + 1). 
In particular, letting Ai{I,M) be the set of all probability measures on 
5f _,_(/) X J^m{C + 1), we have, with probability at least I — S, 



R{exjx)-R{e\r 
1 

< 



n—wX 1 < M < n 



inf ^ inf R^{ej)Ap{ej)-R^{e\r 



+ 



/C(p,7r)+log(|) 



Next, observe that, for p G A^(/,M), 

/C(p, vr) = /C(p, p®i') = /C(p, p/ (g) z/m) + log 

(,?,) 



XO-I/I-A/ 



< k:{p,pi (g) z/Af) + log 



Xo-|/|-M 



(4.2) 



Therefore, with probability at least 1 — 5, 



i?(^A,/A)-i?(r,r 
1 

< 



inf 



inf 



1 A[(2C+l)2+4a2] peA^(7,A./) 
n—wX , ^ . r ^ 



1 < A/ < n 



/C(p, /ij ® Z^Af) + log 



+ 



10-|I|-A/ 



A 



R4ej)dpie,f)-Rr,{e\n 

(4.3) 
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By Lemma [44 and inequality (4.2), for any data-dependent distribution p G 
M), with probability at least 1 — 5, 



< 1 



Rr,{9,f)dp{9,f)-Rr,{9*,f 

^ A[(2C + l)^ + 4a^] 
n — w\ 

K:{p,fii (g) um) +log 
+ 



10- 



Ri9,f)dpi9,f)-R{9\r 
+ log(l) 



A 



(4.4) 



Thus, combining inequalities (4.3) and (4.4), we may write, with probability 
at least 1 — 26, 



R{9xJx)-R{9\n 
1 

< 



inf 



inf 



1 A[(2C+l)2+4<x2] peM{I,M) 



n—w\ 



1 < M < n 



1 + ^^^"^;^^^) 



}C{p,fii O um) + log 



+ 2 



10-|I|-Af 



+ log (I) 



A 



(4.5) 



For any subset / of {1, . . . ,p}, any positive integer M < n and any rj,^ & 
]0, 1/n], let the probability measure p/,Af,r?,7 be defined by 



with 



and 



dp/,M,r,,7(^,/) = dp),M,,,(^)dp/,Af,7(/)' 

'-{9) oc l[||e-e*,,||i<,,] 



dpi 



dpj 



,M,7 



di^M 



(/) oc 1[||/-/;,„i|a/<7] 



where, for / = Y.'jLi f^jVj ^ ^m{C + 1), we put 



M 
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With this notation, inequahty (4.5) leads to 

RiexJx)-Rie\n 

< 



1 < < n 

2 I /I ^21 



A[(2C + l)2 + 4a 
n — wX 

^ipi,M,v,y^l^i ® '^m) + log 



+ 2 



i?(^,/)dp,,Af,,,^(e,/)-i?(r,r 

+log(l) 



A 



(4.6) 



To finish the proof, we have to control the different terms in (4.6). Note first 
that 



log(,^,] <|/|log(^ 



and, consequently. 



log 



IQ-\I\-M 



< |/|log 



pe 



+ (|/| +M)loglO. 



Next, 



By technical Lemma |4.5[ we know that 

^(P/,Af,r,./^/) < (1^1 - 1) log ( max 



Similarly, by technical Lemma 4.6 



'C + 1 

J^{pLM,'y^^M) = Mloi ' 



7 



Putting all the pieces together, we are led to 



lC{pi,M,v,j^ Pi ® '^m) < (|/| - 1) log f max 



+ Mlog 



C + 1 

7 



Finally, it remains to control the term 

R{e,f)dpi,M,,,,{e,f). 



(4.7) 



(4.^ 
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To this aim, we write 



R{OJ)dpi 

,M,V,1 

^Je[{y- /;,m(CmX) + /;,m(CmX) - /(CmX) 
+ fief^x) - f{e^x)Y]dpj,M,rj,,{0, f) 

— -^(^7,M) /7,m) 

+ /e 



+ (/(^^i5;.x)-/rx))^ 

+ 2 (r - /;,m(CmX)) (/;,m(CmX) - /(CmX)) 
+ 2 (r - /;,m«mX)) (/(^fMX) - /(^x)) 

+ 2 (/;,m«mX) - /(^^5;.x)) (/«mX) - /(^^x))] dp,,M,,,,(^, /) 

m,M^ AV) + A + B + C + D + E. 
Computation of C By Fubini's theorem, 
C = E 



= E< 



/ 2 (r - /;,m(CmX)) (/;,m(CmX) - /(CmX)) dp,,M,,,,(^, /) 
2 {Y - /;,m(CmX)) 



X 



/ (/;,m(CmX) - /«mX)) dp^,M,,(/) 



By the triangle inequality, for / = Y^f^i^jVj ^^d f^^ = Y.f=M,M)jVj, it 
holds 

M A7 M 

j=i j=i j=i 

Since //j^ G -FmIC), we have Ej^iil(/5/,M)il < C*. so that E^iJ'l/^jl < 
C + 1 as soon as ||/ — //,mIIm < 1- This shows that the set 

M 

f = ^Pj^3- ii/-/;,mIIm <7 



^M 
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is contained in the support of vm- In particular, this imphes that pj^M,j 
centered at ff j^ and, consequently. 



(/;,m(CmX) - /(CmX)) dpl^M) = 0- 
This proves that C = 0. 

Control of A Clearly, 

A < /sup {{flM - /(?/))' dp^,M,-y(/) < 7'- 



Control of B We have 



E 



(by the mean value theorem) 

<f{c + ifE [\\x\\l] J pi^ - ewldpi^^^ie) 
<^2(c + i)Y 

(by Assumption D). 



Control of E Write 
|E| < 



2 / e[|av«mX)-/«mX)| 

x\f{9^l,X)-f{e^X)\\dpi,M,,,,{9,f) 

>< e{c + 1) K^f^ - r)x|ldp,,M,,,,(^, /) 



< 2 / E 



< 2 



E 



E 



3*T 



r)x 



_(£(c + i)(^^; 

(by the Cauchy-Schwarz inequality) 

<2(72)^(£2(C + 1)V)' 
= 2^(C+ 1)777. 



dpi,M,r],- 



If) 
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Control of D Finally, 



D = 2 y E [(F - /;,m(CmX)) (/«mX) - /(^X))] dp,,M,,,,(^, /) 
(since ^ fdpl^^^if) = Z/^^^) 

- /;,m«mX)) / (/;,m«mX) - /;,m(^^x)) dpi,^,,!^) 



= 2E 



<2./e {Y-fiuo^^^)y 



X We 



(by the Cauchy-Schwarz inequality) 



The inequality 



leads to 



|/;,m(CmX) - /.vrx)! < i{c + i) \{ef^ - Ox| 

<£(C+l)||^U-^l|i 



(/;,m(CmX) - /;,m(^^x)) dp),^,^(^) 



<^'(C + l)' 



Consequently, 

(/;,M(^fMX) - /;,m(^^x)) dpiM,,{6) 



n 2 



<r(c + i)V, 



and therefore 



D < 21{C + l)77Vi?(0,0)/2 

< V2^(C+l)77VC2 + (72. 
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Thus, taking r] = •j = 1/n and putting all the pieces together, we obtain 

A + B + C + D + E<— , 

n 

where Si is a positive constant, function of C, a and i. Combining this 



inequality with (4.6)-(4.8) yields, with probability larger than 1 — 26, 



R{ex,fx)-R{e\n 



n—wX 1 < M < n 



n I A 

Choosing finally 

A 



Mlog(10(C + l)n) + \I\ log(40epn) + log (|) 



w + 2[(2C + 1)2 + 4(t2]' 

we obtain that there exists a positive constant S2, function of L, C, a and 
such that, with probability at least 1 — 26, 



R{9x, h) - R{e\ n < S2 inf ^ <^ R{ei,„ fi,,) - R{e*, n 

IC{l,...,p} 
1 < M < n 



^ Mlog(lOCn) + |/| log(40epra) + log (|) 



n 



This concludes the proof of Theorem 2.1 



4.3 Proof of Corollary 2.1 



We already know, by Theorem |2.1[ that with probability at least 1 — 6, 



R{e,,h)-R{e\n<E inf { R{ei,„ fi^,) - R{e\ r 

IC{l,...,p} 
1 < M < n 



^ Mlog(Cn) + |/| log(pn) + log (I) 



n 



By definition, for all {9, f) G 5f + (J) x J^m{C), 

R{eiMJlM)<R{oj). 
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In particular, if I* is such that 9* G iSf _,_(/*), then 



RiOx, fx) - Rie\ n < E inf <^ R{e\ f) - R{e\ r 

1 < M < n 



+ 



M\og{Cn) + \P\ \og{pn) + log (|) 



n 



Observe that, for any / G J^m{C), 

R{e\f)-R{e\n= I [/(r^x)-r(r^x)]'dP(x,2/) 

Jm> 

< Jjf (t) - r it)f dt. 

Since /* G L2 ([— 1, 1]), we may write 



and apply (4.9) with 



M 



In order to do so, we just need to check that / G J-'m{C), that is 



M 



i=i 

But, by the Cauchy-Schwarz inequality, 

M M 

i=i i=i 



< 



2-2fe 
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Thus, 



M 



Urn 



< 



TT 



M 



*\2 
3 ) 



(since, by assumption, k >2) 

< c 

(since r e W{k, 60^71^)). 



Next, with this choice of /, 



[/(t)-r (t)]'dt< AM 



-2k 



for some positive constant A depending only on k and C (see for instance 



Tsybakov [42j). Therefore, inequahty (4.9) leads to 



R{9x,fx)-R{6\n<E mi {AM 

KAKn 



-2k 



+ 



M\og{Cn) + \r\ \og{pn) + log (|) 



n 



(4.10) 



Letting [.] be the ceiling function and choosing M = [(n/ log(Cn)) 2/3+1] in 



(4.10) concludes the proof. 



4.4 Some technical lemmas 

Lemma 4.5 For any subset I of {1, . . . ,p}, any positive integer M <n and 
any rj g]0, l/n], let the probability measure p]^M,r] defined by 



Then 



(6) oc ||i<,7]- 



^(P/,Af,r,>^/) < (1^1 - l)log (^max 



Proof. For simplicity, we assume that / = {1, . . . , |/|}. Up to a permuta- 
tion of the coordinates, the proof remains valid for any subset / of {1, . . . ,p}. 
Still for simplicity, we let 9 denote O^j m- By a symmetry argument, it can 
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be assumed that 9 has nonnegative coordinates — ^this just means that 9 is 
arbitrarily fixed in one of the 2l^l-i faces of 5f_+(7). We denote by TA this 
face and note that 



1^1 



Finally, without loss of generality, we suppose that the largest coordinate in 
^ is ^1, and let u be the uniform probability measure on TA, defined by 



dz/ 



{9) = 2\'^-'lieeTA]- 



Set u — min(l/|/|, 77/2), and let 



T2 ={9i-u,92 + u,93,. 
Ts ={9,-u,92,93 + u,. 



.,%l,0,...,0), 
. , 6'|/|, 0, . . . , 0), 



T\i\ = {9i-u,92,93,...,9\i\+u,0,...,0). 

Note that u < 1/|/| < 9-i. Therefore, for each j, all the coordinates of Tj are 
nonnegative. Obviously ||Tj||i = 1, so that, for all j, Tj e J-'A. Denoting by 
K the convex hull of the set {9,T2, . . . ,T\j\}, we also have K C TA. Next, 
observe that ||T,- = 2m < 77, which implies K c{9 eW : \\9-9\\i < 7]}. 



Clearly, 



< log 



h\\0-Ol^\U<r,](iM&) 



-[e&TA]'i-[\\e-e}j^\\,<ri]d^ii{9) 



39 



Thus, 



^p],M,r,, f^l) < log 



< log 



2|/|-i 



( \ 

2|/|-i 



/ 



Observe that K is homothetic to TA., by a factor of u. This means that 



Consequently, we obtain 



^(P/,M,,7'/^^) ^ log 



U 



< (|/| — 1) log ^max 



1^1, 



Lemma 4.6 For any subset I of {1, ... ,p}, any positive integer M <n 
any 7 e ]0, l/n], let the probability measure pj m-y defined by 



dpi 



M,7 



if) ^ 1[||/-/;,mIIm<7] 



where, for f = J^fLi /^j^j ^ J^m{C + 1), we put 



M 



Then 



^(P7,m,7'^m) = Mlog 



C + 1 
7 



Proof. Observe that 



dpi 



(/) dpi^,^(/). 
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Now, 



where C — J '^IWf-fi M\\M<'i]if)^^'^if)- easily follows, using the fact that 
the support of P/,m,7 is included in the set {/ e Tm{C+1) : ||/ — //^mII ^ t}) 



that 



Note that 



^{p],m,v^m) =log(l/C). 

C = / i[ii/-/;,MiiM<7](/)d'^M(/) 

^E*iiil/3.-(/3?,M)il<T](^)l[E'liJl^.l<C+l](^)d^ 



/ 



■[E^ij|^jl<c+i] 



(/3)d/3 



where the second equality is true since um is (the image of) the uniform 
probability measure on e : E^i il/^il < + 1}- This imphes 



/ 



By the triangle inequality, 

M M 



M 



Mjj\ 



Since ff ^ e j^m(C), we have E^i il(/^7,M)jl < C, so that 



^[E,1iil/3.-|<c+i] - ^[E,1iil/3,-(/3|,M).l<7l 



as soon as 7 < 1. We conclude that 

/ 

^(P/,M,7' ^m) = log 



/ 



\J ^[E,"liil/3.-(/3|,M).l<7]^^y 



Mlog 



C + 1 

7 
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5 Annex: Description of the MCMC algo- 
rithm 



This annex is intended to make thoroughly clear the specification of the 
proposal conditional densities ki and ^2 introduced in Section |3| 

5.1 Notation 

To provide explicit formulas for the conditional densities fci((r, h)\{9, /)) and 
/c2((t, h)\{9,f)), we first set 



mf 



f = J2Pf,m and h = J2Ph,m, 

j=i i=i 

where it is recalled that {ipj}'jLi denotes the (non-normalized) trigonometric 
system. We let / (respectively, J) be the set of nonzero coordinates of the 
vector 9 (respectively, r), and denote finally by 9i (respectively, tj) the vector 
of dimension |J| (respectively, \J\) which contains the nonzero coordinates of 
9 (respectively, |r|). Recall that all densities are defined with respect to the 



prior TT, which is made explicit in Subsection 2.2 



For a generic h G J^m^iC + 1), given r G iSf_,_ and s > 0, we let the density 
denSs{h\T,mh) with respect to vr be defined by 



dens^(/i|r, rrih) 



oc exp 



1 / ~ \' 



.i=i 



where the /3j(r, m^) are the empirical least square coefficients given by 



/3j {T,mh)\ G arg min J^i'^^'Yl ^^^^ 



In the experiments, we fixed s = 0.1. Note that simulating with respect to 
dens5(/i|r, rrih) is an easy task, since one just needs to compute a least square 
estimate and then draw from a truncated Gaussian distribution. 
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5.2 Description of ki 

We take 



3 -Lrl=i] 

, k,^_{.\{e,f)) + 2h,= {-\{e,f)) + h,^{-\{e,f)) ^ 

+ 1 i[i<i-fi<p] 

h,^{.\{ej)) + 2h,= i-\{ej)) 

H ^ H\i\=p]- 

Roughly, the idea is that tries to remove one component in 9, ki- keeps 
the same number of components, whereas adds one component. The 
density ki- takes the form 

h,= i{T,h)\{0,f)) = A;i,=(T|^)dens,(/i|T,m^). 

The density ki-{.\9) is the density of r when J — I and 

01 + E 



\\ei + E\ 



-sgn {(61 + E)j(^0^+E)) , 



where E = {Ei, . . . ,E\j\) and the E^ are independent random variables uni- 
formly distributed in [—5,5]. Throughout, the value of S was fixed at 0.5. 
It is noteworthy that when we change the parameter from 6 to r, then we 
also change the function from f to h. Thus, with this procedure, the link 
function h is more "adapted" to r and the subsequent move is more likely to 
be accepted in the Hastings-Metropolis algorithm. 

In the case where we are to remove one component, A;i^_ is given by 
ki- {{T,h)\{e,f)) = ^Cjl[r=0_j]denSs{h\T,mf), 

where is just obtained from 9 by setting the j-th component to and by 
renormalizing the parameter in order to have = 1. We set 



exp(-|gj|)l[|e^.|<5] 
E^6/exp(-|^^|) l[\e,\<s\' 



The idea is that smaller components are more likely to be removed than 
larger ones. Finally, the density takes the form 

^1,+ ((r, h)\{e, /)) = c;i[,_^.=,] ^^dens,(/i|r, m;). 
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We set 

exp(|Er=i (11 -/rx,)) (X,),|) 



where (Xj)j denotes the j-th component of Xj. In words, the idea is that a 
new nonzero coordinate in 9 is more hkely to be interesting in the model if 
the corresponding feature is correlated with the current residual. 

5.3 Description of k2 

In the same spirit, we let the conditional density /c2 be defined by 



f^2 {■\[9:J)) = l[m/=l] 



3 

k,,.{-\{9j))+2h,= mf)) + k2,+ mf)) , 

-r . J-[l<my<n] 



A;2,-(-|(g,/)) + 2A;2,= (-|(g,/)) 



We choose 
and 



4 

3 

k2,= {{T, h)\{9,f )) = lir=0]denSs{h\T,mf) 



k2,+ {{t, h)\{9,f )) = 1 [^=9] dens,, (/i I r,m/ + 1). 

With this choice, nih = m/ + l, which means that the proposal density tries to 
add one coefficient in the expansion of h, while leaving 9 unchanged. Finally 

k2- ((r,/i)|(6',/)) = l[r=0]denSs{h\T,mf - 1), 

and the proposal tries to remove one coefficient in h. 
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